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Abstract 

In  real-world  planning  problems,  we  must  reason  not  only  about  our 
own  goals,  but  about  the  goals  of  other  agents  with  which  we  may  interact. 
Often  these  agents’  goals  are  neither  completely  aligned  with  our  own  nor 
directly  opposed  to  them.  Instead  there  are  opportunities  for  cooperation: 
by  joining  forces,  the  agents  can  all  achieve  higher  utility  than  they  could 
separately.  But,  in  order  to  cooperate,  the  agents  must  negotiate  a  mutu¬ 
ally  acceptable  plan  from  among  the  many  possible  ones,  and  each  agent 
must  trust  that  the  others  will  follow  their  parts  of  the  deal.  Research  in 
multi-agent  planning  has  often  avoided  the  problem  of  making  sure  that 
all  agents  have  an  incentive  to  follow  a  proposed  joint  plan.  On  the  other 
hand,  while  game  theoretic  algorithms  handle  incentives  correctly,  they 
often  don’t  scale  to  large  planning  problems.  In  this  paper  we  attempt  to 
bridge  the  gap  between  these  two  lines  of  research:  we  present  an  efficient 
game-theoretic  approximate  planning  algorithm,  along  with  a  negotiation 
protocol  which  encourages  agents  to  compute  and  agree  on  joint  plans 
that  are  fair  and  optimal  in  a  sense  defined  below.  We  demonstrate  our 
algorithm  and  protocol  on  two  simple  robotic  planning  problems. 


Keywords:  Stochastic  Games,  subgame-perfect  Nash  equilibria,  Q  Learning. 


1  INTRODUCTION 


We  model  the  multi-agent  planning  problem  as  a  general-sum  stochastic  game 
with  cheap  talk:  the  agents  observe  the  state  of  the  world,  discuss  their  plans 
with  each  other,  and  then  simultaneously  select  their  actions.  The  state  and 
actions  determine  a  one-step  reward  for  each  player  and  a  distribution  over  the 
world’s  next  state,  and  the  process  repeats. 

While  talking  allows  the  agents  to  coordinate  their  actions,  it  cannot  by  itself 
solve  the  problem  of  trust:  the  agents  might  lie  or  make  false  promises.  So,  we 
are  interested  in  planning  algorithms  that  find  subgame-perfect  Nash  equilibria. 
In  a  subgame-perfect  equilibrium,  every  deviation  from  the  plan  is  deterred 
by  the  threat  of  a  suitable  punishment,  and  every  threatened  punishment  is 
believable.  To  find  these  equilibria,  planners  must  reason  about  their  own  and 
other  agents’  incentives  to  deviate:  if  other  agents  have  incentives  to  deviate 
then  I  can’t  trust  them,  while  if  I  have  an  incentive  to  deviate,  they  can’t  trust 
me. 

In  a  given  game  there  may  be  many  subgame-perfect  equilibria  with  widely 
differing  payoffs:  some  will  be  better  for  some  agents,  and  others  will  be  better 
for  other  agents.  It  is  generally  not  feasible  to  compute  all  equilibria  [1],  and 
even  if  it  were,  there  would  be  no  obvious  way  to  select  one  to  implement.  It 
does  not  make  sense  for  the  agents  to  select  an  equilibrium  without  consulting 
one  another:  there  is  no  reason  that  agent  A’s  part  of  one  joint  plan  would 
be  compatible  with  agent  B’s  part  of  another  joint  plan.  Instead  the  agents 
must  negotiate,  computing  and  proposing  equilibria  until  they  find  one  which 
is  acceptable  to  all  parties. 

This  paper  describes  a  planning  algorithm  and  a  negotiation  protocol  which 
work  together  to  ensure  that  the  agents  compute  and  select  a  subgame-perfect 
Nash  equilibrium  which  is  both  approximately  Pareto- optimal  (that  is,  its  value 
to  any  single  agent  cannot  be  improved  very  much  without  lowering  the  value  to 
another  another  agent)  and  approximately  fair  (that  is,  near  the  so-called  Nash 
bargaining  point).  Neither  the  algorithm  nor  the  protocol  is  guaranteed  to  work 
in  all  games;  however,  they  are  guaranteed  correct  when  they  are  applicable, 
and  applicability  is  easy  to  check.  In  addition,  our  experiments  show  that  they 
work  well  in  some  realistic  situations.  Together,  these  properties  of  fairness, 
enforceability,  and  Pareto  optimality  form  a  strong  solution  concept  for  a  sto¬ 
chastic  game.  The  use  of  this  definition  is  one  characteristic  that  distinguishes 
our  work  from  previous  research:  ours  is  the  first  efficient  algorithm  that  we 
know  of  to  use  such  a  strong  solution  concept  for  stochastic  games. 

Our  planning  algorithm  performs  dynamic  programming  on  a  set-based  value 
function:  for  P  players,  at  a  state  s,  V  €  V(s)  C  Rp  is  an  estimate  of  the  value 
the  players  can  achieve.  We  represent  V(s)  by  sampling  points  on  its  convex 
hull.  This  representation  is  conservative ,  i.e.,  guarantees  that  we  find  a  subset  of 
the  true  V*(s).  Based  on  the  sampled  points  we  can  efficiently  compute  one-step 
backups  by  checking  which  joint  actions  are  enforceable  in  an  equilibrium. 
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Our  negotiation  protocol  is  based  on  a  multi-player  version  of  Rubinstein’s 
bargaining  game.  Players  together  enumerate  a  set  of  equilibria,  and  then  take 
turns  proposing  an  equilibrium  from  the  set.  Until  the  players  agree,  the  proto¬ 
col  ends  with  a  small  probability  e  after  each  step  and  defaults  to  a  low-payoff 
equilibrium;  the  fear  of  this  outcome  forces  players  to  make  reasonable  offers. 


2  BACKGROUND 

2.1  STOCHASTIC  GAMES 

A  stochastic  game  represents  a  multi-agent  planning  problem  in  the  same  way 
that  a  Markov  Decision  Process  [2]  represents  a  single-agent  planning  problem. 
As  in  an  MDP,  transitions  in  a  stochastic  game  depend  on  the  current  state  and 
action.  Unlike  MDPs,  the  current  (joint)  action  is  a  vector  of  individual  actions, 
one  for  each  player.  More  formally,  a  general-sum  stochastic  game  G  is  a  tuple 
(S,sstart:P,A,T,R,j).  S'  is  a  set  of  states,  and  sstart  G  S  is  the  start  state. 
P  is  the  number  of  players.  A  =  A\  x  A2  x  ...  x  Ap  is  the  finite  set  of  joint 
actions.  We  deal  with  fully  observable  stochastic  games  with  perfect  monitoring, 
where  all  players  can  observe  previous  joint  actions.  T  :  S  x  A  1— >  P(S)  is  the 
transition  function,  where  P(S)  is  the  set  of  probability  distributions  over  S. 
R  :  S  x  A  1— >  Rp  is  the  reward  function.  We  will  write  Rp(s,a)  for  the  pth 
component  of  R(s,  a).  7  €  [0,1)  is  the  discount  factor.  Player  p  wants  to 
maximize  her  discounted  total  value  for  the  observed  sequence  of  states  and 
joint  actions  si,  ai,  S2, 02, . . Vp  =  Y2t=  1  7t_1-Rp(st,  it)-  A  stationary  policy  for 
player  p  is  a  function  np  :  S  1— >  P(Ap).  A  stationary  joint  policy  is  a  vector 
of  policies  7r  =  (tti , . . . ,  7Tp),  one  for  each  player.  A  nonstationary  policy  for 
player  p  is  a  function  np  :  (U£fi0  ( S  x  A)*  x  S)  1— >  P{Ap)  which  takes  a  history 
of  states  and  joint  actions  and  produces  a  distribution  over  player  p’s  actions; 
we  can  define  a  nonstationary  joint  policy  analogously.  For  any  nonstationary 
joint  policy,  there  is  a  stationary  policy  that  achieves  the  same  value  at  every 
state  [3]. 

The  value  function  Vp  :  S  1— >  K.  gives  expected  values  for  player  p  under  joint 
policy  7 r.  The  value  vector  at  state  s,  V7r(s),  is  the  vector  with  components 
Vp(s).  (For  a  nonstationary  policy  7r  we  will  define  Vp(s)  to  be  the  value  if  s 
were  the  start  state,  and  Vp(h)  to  be  the  value  after  observing  history  h.)  A 
vector  V  is  feasible  at  state  s  if  there  is  a  7r  for  which  V7r(s)  =  V,  and  we  will 
say  that  n  achieves  V. 

We  will  assume  public  randomization :  the  agents  can  sample  from  a  desired 
joint  action  distribution  in  such  a  way  that  everyone  can  verify  the  outcome.  If 
public  randomization  is  not  directly  available,  there  are  cryptographic  protocols 
which  can  simulate  it  [4] .  This  assumption  means  that  the  set  of  feasible  value 
vectors  is  convex,  since  we  can  roll  a  die  at  the  first  time  step  to  choose  from  a 
set  of  feasible  policies. 
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2.2  REPEATED  BATTLE  OF  THE  SEXES 


One  well-known  stochastic  game  that  can  illustrate  many  of  the  concepts  we’re 
presented  is  called  Repeated  Battle  of  the  Sexes  or  RBoS.  The  shaded  area  in 
Fig.  1  illustrates  the  set  of  feasible  value  vectors  for  this  game,  which  has  one 
state,  two  players,  and  two  actions  for  each  player,  with  discount  factor  7  =  0.99 
and  reward  function 

(1) 

In  Eq.  1,  the  first  player’s  action  determines  a  row  of  the  table  and  the 
second  player’s  action  determines  a  column.  The  corresponding  entry  lists  the 
payoffs  to  players  one  and  two  in  that  order.  Stochastic  games  with  only  one 
state,  such  as  RBoS,  are  called  repeated  games. 

2.3  EQUILIBRIA 

While  optimal  policies  for  MDPs  can  be  determined  exactly  via  various  algo¬ 
rithms  such  as  linear  programming  [2],  it  isn’t  clear  what  it  means  to  find  an 
optimal  policy  for  a  general  sum  stochastic  game.  So,  rather  than  trying  to 
determine  a  unique  optimal  policy,  we  will  define  a  set  of  reasonable  policies: 
the  Pareto-dominant  subgame-perfect  Nash  equilibria. 

A  (possibly  nonstationary)  joint  policy  7r  is  a  Nash  equilibrium  if,  for  each 
individual  player,  no  unilateral  deviation  from  the  policy  would  increase  that 
player’s  expected  value  for  playing  the  game.  Nash  equilibria  can  contain  in¬ 
credible  threats,  that  is,  threats  which  the  agents  have  no  intention  of  following 
through  on.  To  remove  this  possibility,  we  can  define  the  subgame-perfect  Nash 
equilibria.  A  policy  ir  is  a  subgame-perfect  Nash  equilibrium  if  it  is  a  Nash 
equilibrium  in  every  possible  subgame:  that  is,  if  there  is  no  incentive  for  any 
player  to  deviate  after  observing  any  history  of  joint  actions. 

Finally,  consider  two  policies  n  and  f>.  If  Vff  (.ssta.rt)  >  ^(Sstart)  for  all 
players  p,  and  if  V^Sstart)  >  Vp (sstart)  for  at  least  one  p,  then  we  will  say  that 
7r  Pareto  dominates  <f>.  A  policy  which  is  not  Pareto  dominated  by  any  other 
policy  is  Pareto  optimal. 

RBoS  has  three  stationary  subgame-perfect  Nash  equilibria,  whose  value 
vectors  are  indicated  with  o  in  Fig.  I.1  The  equilibrium  marked  with  both  o 
and  x  is  Pareto  dominated  by  the  other  two  equilibria  (marked  with  o  only),  but 
neither  of  the  latter  two  equilibria  dominates  the  other.  The  top  right  border 
of  the  feasible  set  (red  where  color  is  available)  corresponds  to  the  set  of  Pareto 
optimal  policies. 

1  These  equilibria  are:  always  play  ai,  01;  always  play  a  2 , 02 :  and  randomize  with  P(aj )  =  ) 
for  player  1  and  P(ai)  =  |  for  player  2 
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Figure  1:  Illustration  of  feasible  values,  safety  values,  equilibria,  and  the  folk 
theorem  for  RBoS. 


2.4  RELATED  WORK 

Liftman  and  Stone  [5]  give  an  algorithm  for  finding  Nash  equilibria  in  two- 
player  repeated  games.  Hansen  et  al.  [6]  show  how  to  eliminate  very-weakly- 
dominated  strategies  in  partially  observable  stochastic  games.  Doraszelski  and 
Judd  [7]  show  how  to  compute  Markov  perfect  equilibria  in  continuous-time 
stochastic  games.  The  above  papers  use  solution  concepts  much  weaker  than 
Pareto-dominant  subgame-perfect  equilibrium,  and  do  not  address  negotiation 
and  coordination.  Perhaps  the  closest  work  to  the  current  paper  is  by  Brafman 
and  Tennenholtz  [8]:  they  present  learning  algorithms  which,  in  repeated  self¬ 
play,  find  Pareto-dominant  (but  not  subgame-perfect)  Nash  equilibria  in  matrix 
and  stochastic  games.  By  contrast,  we  consider  a  single  play  of  our  game, 
but  allow  “cheap  talk”  beforehand.  And,  our  protocol  encourages  arbitrary 
algorithms  to  agree  on  Pareto-dominant  equilibria,  while  their  result  depends 
strongly  on  the  self-play  assumption. 


2.4.1  FOLK  THEOREMS 

In  any  game,  each  player  can  guarantee  herself  an  expected  discounted  value 
regardless  of  what  actions  the  other  players  takes.  We  call  this  value  the  safety 
value.  Suppose  that  there  is  a  stationary  subgame-perfect  equilibrium  which 
achieves  the  safety  value  for  both  players;  call  this  the  safety  equilibrium  policy. 

Suppose  that,  in  a  repeated  game,  some  stationary  policy  7 r  is  better  for 
both  players  than  the  safety  equilibrium  policy.  Then  we  can  build  a  subgame- 
perfect  equilibrium  with  the  same  payoff  as  7r:  start  playing  7r,  and  if  someone 
deviates,  switch  to  the  safety  equilibrium  policy.  So  long  as  7  is  sufficiently  large, 
no  rational  player  will  want  to  deviate.  This  is  the  folk  theorem  for  repeated 
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Figure  2:  Equilibria  of  a  Rubinstein  game  with  7  =  0.8.  Shaded  area  shows 
feasible  value  vectors  (U±(x),U2{x))  for  outcomes  x.  Right-hand  circle  corre¬ 
sponds  to  equilibrium  when  player  1  moves  first,  left-hand  circle,  when  player  2 
moves  first.  Nash  point  is  indicated  by  O. 


games:  any  feasible  value  vector  which  is  strictly  better  than  the  safety  values 
corresponds  to  a  subgame-perfect  Nash  equilibrium  [9].  (The  proof  is  slightly 
more  complicated  if  there  is  no  safety  equilibrium  policy,  but  the  theorem  holds 
for  any  repeated  game.) 

There  is  also  a  folk  theorem  for  general  stochastic  games  [3].  This  theorem, 
while  useful,  is  not  strong  enough  for  our  purposes:  it  only  covers  discount 
factors  7  which  are  so  close  to  1  that  the  players  don’t  care  which  state  they  wind 
up  in  after  a  possible  deviation.  In  most  practical  stochastic  games,  discount 
factors  this  high  are  unreasonably  patient.  When  7  is  significantly  less  than  1, 
the  set  of  equilibrium  vectors  can  change  in  strange  ways  as  we  change  7  [10]. 

In  RBoS,  each  player  can  guarantee  herself  an  expected  reward  of  min{|  • 
3,  |  •  4}  =  i?  on  each  step.  This  level  of  reward  is  the  safety  value  (dashed  lines 
in  Fig.  1).  It  happens  that  there  is  a  stationary  subgame-perfect  equilibrium 
which  achieves  the  safety  value  for  both  players;  this  is  the  safety  equilibrium 
policy  for  RBoS.  This  disagreement  policy  is  for  each  player  to  play  her  less 
preferred  action  |  of  the  time  and  her  more  preferred  action  |  of  the  time. 

2.4.2  RUBINSTEIN’S  GAME 

Rubinstein  [11]  considered  a  game  where  two  players  divide  a  slice  of  pie.  The 
first  player  offers  a  division  x,  1—x  to  the  second;  the  second  player  either  accepts 
the  division,  or  refuses  and  offers  her  own  division  1  —  y,y.  The  game  repeats 
until  some  player  accepts  an  offer  or  until  either  player  gives  up.  In  the  latter 
case  neither  player  gets  any  pie.  Rubinstein  showed  that  if  player  p's  utility 
for  receiving  a  fraction  x  at  time  t  is  Up(x,t)  =  7 tUp(x)  for  a  discount  factor 
0  <  7  <  1  and  an  appropriate  time-independent  utility  function  Up(x)  >  0,  then 
rational  players  will  agree  on  a  division  near  the  so-called  Nash  bargaining  point. 
This  is  the  point  which  maximizes  the  product  of  the  utilities  that  the  players 
gain  by  cooperating,  Ui(x)U2(l  —  x).  As  7  j  1,  the  equilibrium  will  approach 
the  Nash  point.  See  Fig.  2  for  an  illustration.  For  three  or  more  players,  a 
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similar  result  holds  where  agents  take  turns  proposing  multi-way  divisions  of 
the  pie  [12]. 

While  the  game  above  is  restricted  to  two  players,  there  is  also  a  general 
multi-player  version  of  the  bargaining  game.  The  multi-player  version  works  as 
follows.  Agents  take  turns  proposing  multi-way  divisions  of  a  pie.  After  each 
proposal,  all  agents  other  than  the  proposer  decide  independently  whether  to 
accept  or  reject.  If  all  agents  accept,  the  proposal  is  implemented.  Otherwise, 
any  agents  who  accepted  have  their  shares  fixed  at  the  proposed  level  and  are 
removed  from  further  play;  the  next  remaining  agent  then  proposes  a  division 
of  the  remaining  pie.  As  in  the  two-player  game,  the  unique  subgame-perfect 
equilibrium  approaches  the  Nash  point  as  7  f  1  [12]. 

2.5  NASH  BARGAINING  POINT 

In  a  multi-player  game,  the  Nash  bargaining  point  is  the  solution  maximizing 
the  product  of  the  excess  values  to  each  player  above  her  safety  value.  That  is, 

VNash  =  arg  max  f  JJ  (Vp  -  V™fety) 

This  argmax  is  taken  only  over  values  of  V  such  that  Vp  >  l^safety  for  all 
p.  The  Nash  bargaining  point  can  be  uniquely  characterized  as  meeting  some 
criteria  for  a  “good”  bargaining  solution,  such  as  symmetry  and  weak  Pareto 
optimality.  See  [13]  or  [14]  for  more  details. 


3  NEGOTIATION  PROTOCOL 

The  Rubinstein  game  implicitly  assumes  that  the  result  of  a  failure  to  cooperate 
is  known  to  all  players:  nobody  gets  any  pie.  The  multi-player  version  of  the 
game  assumes  in  addition  that  giving  one  player  a  share  of  the  pie  doesn’t  force 
us  to  give  a  share  to  any  other  player.  Neither  of  these  properties  holds  for 
general  sum  stochastic  games.  They  are,  however,  easy  to  check,  and  often  hold 
or  can  be  made  to  hold  for  planning  domains  of  interest. 

So,  we  will  assume  that  the  players  have  agreed  beforehand  on  a  subgame- 
perfect  equilibrium  policy  7rdls,  called  the  disagreement  policy,  that  they  will 
follow  in  the  event  of  a  negotiation  failure.  In  addition,  for  games  with  three  or 
more  players,  we  will  assume  that  each  player  can  unilaterally  reduce  her  own 
utility  by  any  desired  amount  without  affecting  other  players’  utilities.2 

2  Our  results  for  the  multi-player  problem  also  hold  under  the  alternate  assumption  that 
utilities  are  transferable,  by  an  argument  due  to  Krishna  and  Serrano  [12].  We  prefer  our  stated 
assumption,  since  it  does  not  require  the  players’  utilities  to  be  expressed  in  compatible  units. 
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Given  these  assumptions,  our  protocol  proceeds  in  two  phases.  In  the  first 
phase  agents  compute  subgame-perfect  equilibria  and  take  turns  revealing  them. 
On  an  agent’s  turn  she  either  reveals  an  equilibrium  or  passes;  if  all  agents  pass 
consecutively,  the  protocol  proceeds  to  the  second  phase.  When  an  agent  states 
a  policy  7r,  the  other  agents  verify  that  tt  is  a  subgame-perfect  equilibrium  and 
calculate  its  payoff  vector  V’r(sstart);  players  who  state  non-equilibrium  policies 
miss  their  turn  (Such  players  are  assigned  to  receive  their  disagreement  utilities, 
as  described  below.) 

At  the  end  of  the  first  phase,  suppose  the  players  have  revealed  a  set  II  of 
policies.  Define 

Xp{tt)  =  V”(sstalt)  -  Vpdis(sstart) 

U  =  convhull  {X(7r)  |  tt  £  11} 

U  =  {u  >  0  |  (3v  €  U  |  u<  v)} 

where  Vdls  is  the  value  function  of  7rdls,  Xp(r r)  is  the  excess  of  policy  tt  for  player 
p,  and  U  is  the  set  of  feasible  excess  vectors. 

In  the  second  phase,  players  take  turns  proposing  points  u  €  U  along  with 
policies  or  mixtures  of  policies  in  II  that  achieve  them.  After  each  proposal,  all 
agents  except  the  proposer  decide  whether  to  accept  or  reject.  If  everyone 
accepts,  the  proposal  is  implemented:  everyone  starts  executing  the  agreed 
equilibrium. 

Otherwise,  the  players  who  accepted  are  removed  from  future  negotiation 
and  have  their  utilities  fixed  at  the  proposed  levels.  Fixing  player  p’s  utility  at 
Up  means  that  all  future  proposals  must  give  p  exactly  up.  (Invalid  proposals 
result  in  the  proposer  losing  her  turn.)  To  achieve  this,  the  proposal  may  require 
p  to  voluntarily  lower  her  own  utility;  this  requirement  is  enforced  by  the  threat 
that  all  players  will  revert  to  7rdls  if  p  fails  to  act  as  required.  The  choose(i) 
in  Figure  3  marks  the  place  in  the  protocol  where  agent  i  gets  to  choose  one 
of  several  alternatives:  i  picks  which  of  the  lines  inside  the  choose/end  pair 
will  execute.  The  parameter  e  is  an  arbitrary  small  positive  number  which 
determines  whether  we  force  a  phase  to  end  early;  it  should  be  small  enough 
that  there  is  little  risk  of  the  protocol  ending  before  the  agents  want  it  to,  but 
large  enough  that  the  agents  feel  pressure  to  arrive  at  an  agreement  rather  than 
stalling  forever.  At  the  end  of  Phase  I,  the  set  pol  contains  the  policies  which 
the  agents  will  bargain  over  in  Phase  II. 

If  at  some  point  one  of  the  remaining  players  declares  that  further  nego¬ 
tiation  is  pointless,  or  if  we  hit  the  e  chance  of  having  the  current  round  of 
communication  end,  all  remaining  players  are  assigned  their  disagreement  val¬ 
ues.  The  players  execute  the  last  proposed  policy  tt  (or  7rdls  if  there  has  been 
no  valid  proposal),  and  any  player  p  for  whom  Vp (ssta,rt)  is  greater  than  her 
assigned  utility  up  voluntarily  lowers  her  utility  to  the  correct  level.  (Again, 
failure  to  do  so  results  in  all  players  reverting  to  7rdls.) 

Under  the  above  protocol,  player’s  preferences  are  the  same  as  in  a  Rubin- 
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pol «-  0 
repeat 

done  <—  true 
for  each  agent  i 
choose(z) 
i  says  “pass” 

i  adds  a  policy  or  set  of  policies  to  pol;  done  <—  false 

end  choose 
end  for 

With  probability  e,  done  <—  true 
until  done 


Figure  3:  Phase  I  of  the  negotiation  protocol. 


stein  game  with  utility  set  U:  because  we  have  assumed  that  negotiation  ends 
with  probability  e  after  each  message,  agreeing  on  u  after  t  additional  steps  is 
exactly  as  good  as  agreeing  on  u(l  —  e)*  now.  So  with  e  sufficiently  small,  the 
Rubinstein  or  Krishna-Serrano  results  show  that  rational  players  will  agree  on 
a  vector  u  £  U  which  is  close  to  the  Nash  point  argmaxueUIIpup.  Because  of 
this  property  we  can  give  a  coarse  description  of  how  the  agents  should  play  in 
Phase  I:  they  should  nominate  policies  that  give  themselves  high  payoffs  to  try 
to  steer  the  outcome  in  their  favor.  But  they  will  also  want  to  nominate  policies 
that  give  high  payoffs  to  other  agents,  because  such  policies  are  more  likely  to 
eventually  be  incorporated  into  the  plan  accepted  by  the  group  as  a  whole. 

In  figures  3  and  4  we  give  an  algorithmic  description  of  the  specific  protocol 
followed  by  negotiating  players. 


4  COMPUTING  EQUILIBRIA 

In  order  to  use  the  protocol  of  Sec.  3  for  bargaining  in  a  stochastic  game,  the 
players  must  be  able  to  compute  some  subgame-perfect  equilibria.  Computing 
equilibria  is  a  hard  problem  [15],  so  we  cannot  expect  real  agents  to  find  the 
entire  set  of  equilibria.  Fortunately,  each  player  will  want  to  find  the  equilibria 
which  are  most  advantageous  to  herself  to  influence  the  negotiation  process  in 
her  favor.  But  equilibria  which  offer  other  players  reasonably  high  reward  have 
a  higher  chance  of  being  accepted  in  negotiation.  So,  self  interest  will  naturally 
distribute  the  computational  burden  among  all  the  players. 

In  this  section  we  describe  an  efficient  dynamic-programming  algorithm  for 
computing  equilibria.  The  algorithm  takes  some  low-payoff  equilibria  as  input 
and  (usually)  outputs  higher-payoff  equilibria.  It  is  based  on  the  intuition  that 
we  can  use  low-payoff  equilibria  as  enforcement  tools:  by  threatening  to  switch 


for  each  agent  i 
utility [i]  <—  di 
accepted  [i]  <—  false 

end  for 
repeat 

for  each  agent  i 

if  accepted[t]  then  continue 

i  proposes  a  distribution  s  over  complete  policies  from  pol 

done  <—  true 

for  each  agent  j  /  i 

if  accepted[j]  then  continue 
u  <—  utility  of  s  to  j 
choos  e(j) 

j  says  “accept”;  utility  [j]  <—  u;  accepted[j]  <—  true 
j  says  “reject”;  done  <—  false 

end  choose 
end  for 
if  done  then 

utility  [t]  <—  utility  of  s  to  i 

return 
end  if 
end  for 

With  probability  e,  done  <—  true 
until  done 


Figure  4:  Phase  II  of  the  negotiation  protocol. 
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Initialization 

for  s  G  S 

v(s)  <-  {v  |  cpdis(s)  <  yp  <  f?max/(  1  -  7)} 

end 

Repeat  until  converged 
for  iteration  4—  1,2,... 

for  s  €  S 

Compute  value  vector  set  for  each  joint  action, 
then  throw  away  unenforceable  vectors 

for  a  £  A 

Q (s,a)  4-  {R(s,a)}  +  7Es.esT(s,a)(s')V(s') 
QO,a)  4-  {Q  G  Q(s,  a)  |  Q  >  Vdev(s,a)} 

end 

We  can  now  randomize  among  joint  actions 
V(s)  4—  convhull  Ua  Q(s,a) 

end 

end 


Figure  5:  Dynamic  programming  using  exact  operations  on  sets  of  value  vectors 


to  an  equilibrium  that  has  low  value  to  player  p,  we  can  deter  p  from  deviating 
from  a  cooperative  policy. 

In  more  detail,  we  will  assume  that  we  are  given  P  different  equilibria 
7rfun, . . . ,  7TpUn;  we  will  use  7rPun  to  punish  player  p  if  she  deviates.  We  can 
set  7rPun  =  7rdls  for  all  p  if  7rdls  is  the  only  equilibrium  we  know;  or,  we  can 
use  any  other  equilibrium  policies  that  we  happen  to  have  discovered.  The  al¬ 
gorithm  will  be  most  effective  when  the  value  of  7rPun  to  player  p  is  as  low  as 
possible  in  all  states. 

We  will  then  search  for  cooperative  policies  that  we  can  enforce  with  the 
given  threats  7rPun.  We  will  first  present  an  algorithm  which  pretends  that  we 
can  efficiently  take  direct  sums  and  convex  hulls  of  arbitrary  sets.  This  algorithm 
is  impractical,  but  finds  all  enforceable  value  vectors.  We  will  then  turn  it  into 
an  approximate  algorithm  which  uses  finite  data  structures  to  represent  the 
set-valued  variables.  As  we  allow  more  and  more  storage  for  each  set,  the 
approximate  algorithm  will  approach  the  exact  one;  and  in  any  case  the  result 
will  be  a  set  of  equilibria  which  the  agents  can  execute. 

4.1  THE  EXACT  ALGORITHM 

Our  algorithm  maintains  a  set  of  value  vectors  V (s)  for  each  state  s.  It  initializes 
V (s)  to  a  set  which  we  know  contains  the  value  vectors  for  all  equilibrium 
policies.  It  then  refines  V  by  dynamic  programming:  it  repeatedly  attempts  to 
improve  the  set  of  values  at  each  state  by  backing  up  all  of  the  joint  actions, 
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excluding  joint  actions  from  which  some  agent  has  an  incentive  to  deviate. 

In  more  detail,  we  will  compute  V^,dls(s)  =  V£rdis(s)  for  all  s  and  p  and  use 
the  vector  Vdls(s)  in  our  initialization.  (Recall  that  we  have  defined  V™ (s) 
for  a  nonstationary  policy  7r  as  the  value  of  7r  if  s  were  the  start  state.)  We 
also  need  the  values  of  the  punishment  policies  for  their  corresponding  players, 

7rpun 

VpPun(s)  =  Vp  p  (s)  for  all  p  and  s.  Given  these  values,  define 

Qdev(s,  a)  =  Rp(s,  a)  +  7  E  T(s-  «)(«XPUV)  (2) 

s'es 

to  be  the  value  to  player  p  of  playing  joint  action  a  from  state  s  and  then 
following  7rPun  forever  after. 

From  the  above  Qdev  values  we  can  compute  player  p’s  value  for  deviating 
from  an  equilibrium  which  recommends  action  a  in  state  s:  it  is  Qdev(s,  a')  for 
the  best  possible  deviation  a',  since  p  will  get  the  one-step  payoff  for  a'  but  be 
punished  by  the  rest  of  the  players  starting  on  the  following  time  step.  That  is, 

V),dev(s,  a)  =  max  Qdev(s,  ci\  x  ...  x  a'p  x  ...  x  ap)  (3) 

Fpdev(s,  a)  is  the  value  we  must  achieve  for  player  p  in  state  s  if  we  are  planning 
to  recommend  action  a  and  punish  deviations  with  7rPun:  if  we  do  not  achieve 
this  value,  player  p  would  rather  deviate  and  be  punished. 

Our  algorithm  is  shown  in  Fig.  5.  After  k  iterations,  each  vector  in  V(s) 
corresponds  to  a  fc-step  policy  in  which  no  agent  ever  has  an  incentive  to  deviate. 
In  the  k  +  1st  iteration,  the  first  assignment  to  Q(s,o)  computes  the  value 
of  performing  action  a  followed  by  any  fc-step  policy.  The  second  assignment 
throws  out  the  pairs  (a,  7 r)  for  which  some  agent  would  want  to  deviate  from 
a  given  that  the  agents  plan  to  follow  7r  in  the  future.  And  the  convex  hull 
accounts  for  the  fact  that,  on  reaching  state  s,  we  can  select  an  action  a  and 
future  policy  7r  at  random  from  the  feasible  pairs.3  Proofs  of  convergence  and 
correctness  of  the  exact  algorithm  are  in  the  appendix. 

5  Approximate  Algorithm 

The  exact  algorithm  performs  operations  on  convex  sets  of  value  vectors.  Ac¬ 
tually  storing  these  sets  exactly  may  require  a  prohibitive  amount  of  space,  and 
thus  a  prohibitive  amount  of  computation  to  perform  operations  on  these  sets. 
So  our  approximate  algorithm,  rather  that  storing  V(s)  explicitly,  chooses  a 
finite  set  of  witness  vectors  W  C  Mp  and  stores  V(s,  w)  =  argmaxveV(s)(v' w) 
for  each  w  £  W.  V(s)  is  then  approximated  by  the  convex  hull  of  {V(s,  w)  | 

’ll  is  important  for  this  randomization  to  occur  after  reaching  state  s  to  avoid  introducing 
incentives  to  deviate,  and  it  is  also  important  for  the  randomization  to  be  public. 
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Initialization 
for  s  €  S,  w  €  W 

V (s,  w)  <-  w  f?max/ (1  -  7) 

end 

Repeat  until  converged 
for  iteration  <—  1,2,... 
for  w  G  W,  s  €  S 

Approximate  value  vector  set  for  each  joint  action, 
then  throw  away  unenforceable  vectors 

for  a  £  A 

Q(s,a)  <-  f?(s,a)  +  7^s/eS?1(s,a)(s,)f/(s,,w) 
if  Q(s,a)  t  Pdev(s,a) 

Q(s,  a)  <-  Fdis(s) 

end 

end 

Approximate  the  convex  hull 
V(s,w)  <-  argmaxqe{Q(Sja)|aeA}  q  •  w 

end 

end 


Figure  6:  Dynamic  programming  using  approximate  operations  on  arrays  of 
value  vectors 


w  £  W}.  The  approximate  algorithm  is  shown  in  Fig.  6.  With  a  small  |W|  it 
is  inaccurate  but  conservative:  because  the  convex  hull  of  V (s,  w)  taken  over 
w  is  smaller  than  V(s),  we  can  only  discard  vectors  and  not  add  them,  so  all 
of  the  returned  value  vectors  will  still  correspond  to  vectors  that  the  exact  al¬ 
gorithm  would  have  returned.  However,  if  W  samples  the  P-dimensional  unit 
hypersphere  densely  enough,  the  maximum  possible  approximation  error  will 
be  small.  (In  practice,  each  agent  will  probably  want  to  pick  W  differently,  to 
focus  her  computation  on  policies  in  the  portion  of  the  Pareto  frontier  where 
her  own  utility  is  relatively  high.)  As  |W|  increases,  the  error  introduced  at 
each  step  will  go  to  zero. 


6  EXPERIMENTS 

We  tested  our  value  iteration  algorithm  and  negotiation  procedure  on  two  ro¬ 
botic  planning  domains:  a  joint  motion  planning  problem  and  a  supply-chain 
management  problem. 

In  our  motion  planning  problem  (Fig.  7),  two  players  together  control  a  two¬ 
wheeled  robot,  with  each  player  picking  the  rotational  velocity  for  one  wheel. 
Each  player  has  a  list  of  goal  landmarks  which  she  wants  to  cycle  through,  but 
the  two  players  can  have  different  lists  of  goals.  We  discretized  states  based  on 
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Figure  7:  Execution  traces  for  our  motion  planning  example.  Left  and  Center: 
with  2  witness  vectors  per  state,  the  agents  randomize  between  two  selfish  paths. 
Right:  with  4-32  witnesses  per  state,  the  agents  find  a  cooperative  path.  Steps 
where  either  player  achieved  a  goal  are  marked  with  x . 


X,Y,9  and  the  current  goals,  and  discretized  actions  into  stop,  slow  (0.45—), 
and  fast  (0.9™),  for  9  joint  actions  and  about  25,000  states.  We  discretized  time 
at  At  =  Is,  and  set  7  =  0.99. 

For  both  the  disagreement  policy  and  all  punishment  policies,  we  used  “al¬ 
ways  stop,”  since  by  keeping  her  wheel  stopped  either  player  can  prevent  the 
robot  from  moving.  Planning  took  a  few  hours  of  wall  clock  time  on  a  desktop 
workstation  for  32  witnesses  per  state. 

Based  on  the  planner’s  output,  we  ran  our  negotiation  protocol  to  select  an 
equilibrium.  Fig.  7  shows  the  results:  with  limited  computation  the  players 
pick  two  selfish  paths  and  randomize  equally  between  them,  while  with  more 
computation  they  find  the  cooperative  path. 

Usually,  in  the  first  phase  of  the  negotiation  protocol,  we  simply  had  each 
agent  reveal  all  the  policies  she  knew  about;  this  strategy  is  optimal  if  both 
agents  know  the  same  set  of  equilibria,  as  they  do  here.  In  the  second  phase, 
the  optimal  strategy  is  for  the  first  player  to  immediately  propose  the  Nash 
point  and  for  the  second  to  accept.4 

We  also  ran  experiments  in  the  same  domain,  but  limiting  the  computation 
of  one  agent  and  determining  how  that  would  affect  the  outcome  of  negotia¬ 
tion.  Fig.  8  shows  the  results  of  negotiation  between  two  players  using  different 
amounts  of  computation.  Because  the  more  restricted  agent  doesn’t  know  about 
some  of  the  best  equilibria,  the  less  restricted  agent  can  influence  negotiation 
by  revealing  only  some  of  the  equilibria  that  she  knows  about,  and  can  alter  the 
outcome  significantly  in  her  favor.  But,  revealing  too  few  equilibria  leads  to  an 
outcome  that  is  worse  for  both  agents. 

4For  the  purpose  of  this  experiment,  we  take  e  to  be  so  small  as  to  make  the  difference 
between  the  equilibrium  and  the  Nash  point  negligible.  It  is  an  interesting  subject  for  future 
work  to  determine  how  large  e  needs  to  be  to  give  real  agents  the  necessary  incentive  to  come 
to  an  agreement  expeditiously. 
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Figure  8:  Negotiation  between  agents  with  different  computational  abilities. 
Solid  line:  Pareto  frontier  computed  by  a  32-witness  agent;  dash-dot  line:  4- 
witness  agent’s  frontier;  x  marks:  Nash  points  of  sets  formed  from  the  4-witness 
agent’s  frontier  and  some  of  the  32-witness  agent’s  frontier;  ®  mark:  Nash  point 
of  full  set. 


For  our  second  experiment  we  examined  a  more  realistic  supply-chain  prob¬ 
lem.  Here  each  player  is  a  parts  supplier  competing  for  the  business  of  an  engine 
manufacturer.  The  manufacturer  doesn’t  store  items  and  will  only  pay  for  parts 
which  can  be  used  immediately.  Each  player  controls  a  truck  which  moves  parts 
from  warehouses  to  the  assembly  shop;  she  pays  for  parts  when  she  picks  them 
up,  and  receives  payment  on  delivery.  Each  player  gets  parts  from  different 
locations  at  different  prices  and  neither  player  can  individually  provide  all  of 
the  parts  the  manufacturer  needs. 

Each  player’s  truck  can  be  at  six  locations  along  a  line:  four  warehouse 
locations  (each  of  which  provides  a  different  type  of  part),  one  empty  location, 
and  the  assembly  shop.  Building  an  engine  requires  five  parts,  delivered  in 
the  order  A,  {B,  C},  D,  E  (parts  B  and  C  can  arrive  in  either  order).  After 
E,  the  manufacturer  needs  A  again.  Players  can  move  left  or  right  along  the 
line  at  a  small  cost,  or  wait  for  free.  They  can  also  buy  parts  at  a  warehouse 
(dropping  any  previous  cargo),  or  sell  their  cargo  if  they  are  at  the  shop  and  the 
manufacturer  wants  it.  Each  player  can  only  carry  one  part  at  a  time  and  only 
one  player  can  make  a  delivery  at  a  time.  Finally,  any  player  can  retire  and  sell 
her  truck;  in  this  case  the  game  ends  and  all  players  get  the  value  of  their  truck 
plus  any  cargo.  The  disagreement  policy  is  for  all  players  to  retire  at  all  states. 
Fig.  9  shows  the  computed  sets  V (sstart)  for  various  numbers  of  witnesses.  The 
more  witnesses  we  use,  the  more  accurately  we  represent  the  frontier,  and  the 
closer  our  final  policy  is  to  the  true  Nash  point. 

All  of  the  policies  computed  are  “intelligent”  and  “cooperative”:  a  human 
observer  would  not  see  obvious  ways  to  improve  them,  and  in  fact  would  say 
that  they  look  similar  despite  their  differing  payoffs.  Players  coordinate  their 
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Figure  9:  Supply  chain  management  problem.  In  the  left  figure,  Player  1  is  about 
to  deliver  part  D  to  the  shop,  while  player  2  is  at  the  warehouse  which  sells  B. 
The  right  figure  shows  the  tradeoff  between  accuracy  and  computation  time. 
The  solid  curve  is  the  Pareto  frontier  for  sstart,  as  computed  using  8  witnesses 
per  state.  The  dashed  and  dotted  lines  were  computed  using  2  and  4  witnesses, 
respectively.  Dots  indicate  computed  value  vectors;  x  marks  indicate  the  Nash 
points. 


motions,  so  that  one  player  will  drive  out  to  buy  part  E  while  the  other  delivers 
part  D.  They  sit  idle  only  in  order  to  delay  the  purchase  of  a  part  which  would 
otherwise  be  delivered  too  soon. 


7  CONCLUSION 

Real-world  planning  problems  involve  negotiation  among  multiple  agents  with 
varying  goals.  To  take  all  agents  incentives  into  account,  the  agents  should 
find  and  agree  on  Pareto-dominant  subgame-  perfect  Nash  equilibria.  For  this 
purpose,  we  presented  efficient  planning  and  negotiation  algorithms  for  general- 
sum  stochastic  games,  and  tested  them  on  two  robotic  planning  problems. 
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A  Proof  of  Convergence  of  Value  Iteration 

In  the  exact  algorithm  of  figure  5  we  presented  a  dynamic  programming  algorithm 
for  computing  the  value  vectors  achievable  in  equilibrium  in  a  stochastic  game.  In 
the  section  we  provide  a  proof  of  the  correctness  of  this  algorithm.  Specifically,  we 
show  that  the  algorithm  will  converge  and,  after  convergence,  will  return  the  set  of 
discounted  value  vectors  achievable  in  subgame  perfect  equilibrium  using  the  given  set 
of  punishment  policies.  We’ll  start  by  analyzing  a  simplified  version  of  our  algorithm, 
which  omits  the  pruning  step.  (This  version  computes  all  achievable  value  vectors, 
without  regard  to  whether  they  are  achievable  in  equilibrium.)  Then  we  will  generalize 
the  proof  to  apply  to  the  full  version  of  our  algorithm. 
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Both  versions  of  our  algorithm  can  be  seen  as  repeatedly  applying  a  value-iteration 
backup  operator  (T  or  Tprune,  defined  below)  to  an  initial  conservative  estimate  of  the 
achievable  values.  In  contrast  to  the  version  of  value  iteration  for  discounted  MDPs, 
our  operators  are  not  contractions  in  any  standard  norm.  Instead,  our  proofs  rely  on 
a  monotonicity  property,  described  in  more  detail  below. 


A.l  Definitions 


We  will  start  with  some  definitions  that  will  be  useful  in  our  proof.  As  above,  there 
are  N  states  and  P  players,  a  £  A  is  a  joint  action  in  the  full  set  of  joint  actions  A,  and 
R(s,  a)  is  the  one-step  reward  vector  for  state  s  and  joint  action  a.  Ra  is  an  IV-vector 
of  P-vectors,  telling  the  rewards  in  each  state  to  each  player  of  following  joint  action  a. 
Pa  £  RJVxJV  is  a  transition  matrix  corresponding  to  joint  action  a.  So  Vi,j.Pa,ij  >  0 
and  Vi.  J2] ]Zi  Pa,n  =  1- 

Write  Rm  for  the  largest  absolute  value  of  any  one-step  reward  to  any  player  in  the 
game.  That  is,  Rm  =  maxajSjP  | Rp(s,  o)|.  Given  RM,  Vm  =  yz4-  is  an  upper-bound 
on  the  absolute  expected  discounted  value  that  any  player,  following  any  policy,  can 
hope  to  achieve. 

Write  V  for  the  vector  of  sets  of  discounted  value  vectors  achievable  at  all  states  in 
the  game.  That  is,  V  is  a  vector  of  length  N\  each  component  V(s)  is  a  subset  of  Rp 
which  represents  a  set  of  value  vectors  achievable  in  the  game  starting  from  state  s. 
We  introduce  the  overline  notation  to  make  it  clear  when  we  are  referring  to  a  vector 
over  game  states.  We  use  the  boldface  notation  to  indicate  that  the  structure  is  a 
set  or  vector  of  sets.  V  is  an  IV-vector  of  sets  of  P-vectors,  where  each  element  of  a 
P-vector  is  a  discounted  value  achievable  to  a  player.  We  only  introduce  this  complex 
structure  (vector  of  sets  of  vectors)  because  it  precisely  captures  what  we  want  to 
know  about  the  game.  We  need  to  capture  all  possible  discounted  values  that  can  be 
achieved,  under  any  policy,  by  all  P  players  starting  from  state  s:  this  is  precisely  a 
set  of  P-vectors.  Since  the  game  can  start  from  any  of  the  N  states,  we  need  N  of 
these  sets  of  P-vectors. 


To  measure  the  size  of  a  vector  of  sets  of  vectors,  we  use  a  generalization  of  the 
infinity  norm: 


IIVIloo  =  max  II  Vi  I 


That  is,  applying  the  infinity  norm  to  a  vector  of  sets  V  returns  the  max  of  the  infinity 
norm  applied  to  each  set  in  V. 

I  C  RP  is  the  hypercube  centered  at  the  origin  with  sides  of  length  2  in  each 
dimension.  That  is,  I  =  (V  £  Rp  |  IIVIloo  <  1}.  I  is  a  vector  of  N  copies  of  I.  So, 
||I||oo  =  I- 

A  +  B,  where  A,  B  C  RP,  is  the  cross-sum  operator:  A  +  B  =  (JagA  6eB{a  +  &}• 
CHa(G(a))  is  the  convex  closure  operator:  if  G(a)  is  a  set  of  vectors  for  each  value 
of  the  dummy  variable  a,  then  CH0(G(«))  is  the  set  of  all  convex  combinations  of 
points  in  (J  Ga.  CHa(G(a))  is  the  componentwise  generalization  of  CH  to  vectors 
of  sets. 


We  can  now  define  the  simplified  transition  operator,  which  is  the  same  as  one 
iteration  of  the  exact  algorithm  in  figure  5  except  that  it  omits  the  pruning  step. 
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Definition 


T(V)  =  CHagA(-Ra  +  7-PaV) 

In  this  expression,  applying  a  function  Pa  to  a  vector  of  sets  is  defined  as  one  might 
expect:  we  use  the  usual  expression  for  a  matrix  multiplication, 

PaV(s)  =Y,(Pa-)s,S'V(s') 


but  with  cross-sum  and  scalar  multiplication  of  sets  rather  than  the  usual  real  sum 
and  product  operations.  This  generalizes  the  usage  of  the  transition  matrix  in  the 
standard  Bellman  backup  equations. 


A. 2  Convergence  of  the  simplified  algorithm 

Having  completed  the  definitions,  the  goal  of  the  first  set  of  proofs  is  to  show  that,  if 
V  is  initialized  to  a  superset  of  the  hypercube  Vm  x  I,  and  the  operation  V  <—  TV  is 
repeatedly  applied,  V  will  converge. 


Lemma  1  For  any  V, 

HPaVUoo  <  ||  V 


PROOF:  Define  Vm  =  || V||oo-  Then 

||TaV^|oo  —  max  ||(-PaV)i||oo  — max  A  Pa . ij ~Vj  I  ^ 
i  i  II z J  II  oo 

3 

max  1 1  ^  Pa.i  j  Cn  |  =  X  max  ^  '  Pg,ij  =  I /rn 

3  3 

The  first  equality  applies  the  definition  of  the  infinity  norm  and  the  second  applies 
the  definition  of  matrix  multiplication.  The  third  inequality  holds  because  all  values 
Pa,ij  >  0,  so  replacing  the  sets  Vj  with  something  at  least  at  large  makes  the  result  no 
smaller.  The  last  equality  holds  because  V,„ ,  a  scalar,  factors  out,  and  by  construction 
the  sum  of  each  row  of  any  transition  matrix  Pa  is  1.  □ 


Lemma  2 

vcvU  PaV  C  Pav' 


Proof:  First  note  that 

V'eV'uV 

Using  this  fact, 

(PaV)i  =  J2  /V,.  V.  C  53  Pa^V'j  =  PaV- 

3  3 

The  first  equality  applies  the  definition  of  matrix  multiplication.  The  second  uses  the 
fact  that  a  scalar  times  a  subset  of  a  set  is  a  subset  of  that  scalar  times  the  set  itself. 
The  third  equality  again  applies  the  definition  of  matrix  multiplication.  □ 
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Lemma  3  VCV'=>  T(V)  C  T(V') 

Proof:  Again  note  that 

V7  =  V  U  V7  (4) 

Therefore 

TV  =  CHaeA(^  +  7Pa(V))  C  CHo€a(B^  +  7Pa(V  U  V7))  =  T(V  U  V7) 

The  first  equality  is  the  definition  of  T.  The  next  relation  follows  from  (4)  and 
Lemma  2.  The  last  equality  again  follows  by  the  definition  of  T.  □ 

Lemma  3  says  that  T  is  monotone.  So,  if  we  start  with  some  vector  V  and  happen 
to  find  that  TV  C  V,  we  can  see  by  applying  T  to  both  sides  of  the  relation  that 
T2V  C  TV,  and  in  general  TfeV  C  Tfe_1V  for  k  >  0.  That  is,  each  application  of 
the  operator  T  gives  a  result  that  is  contained  in  the  previous  iteration’s  V. 

To  take  advantage  of  this  fact,  the  next  few  lemmas  describe  an  initialization  that 
will  guarantee  TV  C  V  for  the  first  backup. 


Lemma  4 

Pa{k  x  I)  C  k  x  I 

for  any  positive  scalar  k. 


Proof:  This  follows  directly  from  lemma  1. 

By  contradiction,  if  Pa(k  x  I)  is  not  a  subset  of  I,  then  || Pa(k  x  I)||oo  >  \\k  x  I||oo, 
a  violation  of  lemma  1  .  □ 


Lemma  5 

llTVHoo  <  Rm  +  7||V||oo 


Proof: 

||TV||oo  =  ||CHo6A(Br  +  7f!.V)l|oo  = 
max(||Pa  +  7PaV||oo)  <  max  ||f?a||oo  +  7max  UPaVUoo  = 

aG-A  aGA.  aGA. 

Rm  +  7||  V||oo 

The  first  equality  applies  the  definition  of  T.  The  second  equality  applies  the 
definition  of  infinity  norm  on  a  vector  of  sets,  and  uses  the  fact  the  the  convex  hull 
operator  on  a  set  won’t  increase  the  infinity  norm.  The  third  equality  uses  the  fact 
that  the  max  of  a  sum  is  not  greater  than  the  sum  of  the  maxes.  The  last  equality 
uses  the  definition  of  Rm  and  lemma  1.  □ 


Lemma  6  T(VmI)  C  Vm I 

Proof: 

||T(VmI)||oo  <  Rm  +  7  x  VM  =  Rm  +7  x  =  VM 

1  —  7  1  —  7 
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Since  ||T(VmI)||oo  <  Vm,  it  follows  that  T(VmI)  C  Vm  x  I  by  the  definition  of  I.  □ 

In  fact,  lemmas  6  and  3  are  sufficient  to  show  convergence.  By  lemma  3,  as  long 
as  V  is  initialized  to  Vm  x  I,  we  have 


TV  C  V 

But  by  lemma  6  we  can  apply  T  to  both  sides  of  this  equation  k  >  0  times  to  get 

THly  rjifcy 

Since  Tfc  is  a  monotone  non-increasing  sequence  which  cannot  become  smaller  than 
the  empty  set,  it  must  converge. 


A. 3  Transition  operator  with  pruning 

To  take  pruning  into  account,  we  can  define  a  new  transition  operator  Tprune.  TprUne 
is  like  T  except  that  it  enforces  incentive  constraints  by  intersecting  its  backed  up 
values  with  a  fixed  set  at  each  state  before  taking  the  convex  hull.  Write  Ga  for  the 
vector  of  N  components  whose  component  Ga(s)  is  the  pruning  set  for  state  s  and 
action  a, 

Ga(s)  =  {V\(Vp)Vp>Vlv(s,a)} 

With  this  definition,  we  can  write 

Definition 

Tprune (V)  =  CHa6A  (G0  0  (Ra  +  7PaV))  (5) 

where  intersection  between  vectors  of  sets  is  defined  to  operate  on  each  component 
separately. 


Lemma  7  //VC  V*,  then  Tprune(V)  C  T^un^V') 
Proof: 


VCV4  Va,  yPaV  C  ■yPaV  =7  V„,  Ra  +  7  A  V  C  Ra  +  7Pav'  =7 
Va,  Ga  n  (Ra  +  7 PaV)  C  Ga  17  (Ra  +  7 PaV')  => 
CHa6A(Gan(f?0+7PaV))  C  CHaeA(Gan(i?a+7PaV'))  =7  TprUne(V)  C  Tprune(V') 

The  first  implication  applies  lemma  2  and  the  second  uses  the  fact  that  multiplying 
by  a  positive  scalar  and  adding  a  vector  preserve  containment  properties.  The  next  two 
implications  use  the  fact  that  intersection  with  a  fixed  set  and  the  convex  hull  operator 
preserve  containment  properties,  and  the  last  implication  applies  the  definition  of 

T  □ 

prune  •  l— 1 


Lemma  8  Tprune(VM  x  I)  C  Vm  x  I 
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Proof: 


Tprun e(VM  X  I)  —  CHagA  |Ga  fl  (7?a  +7 Pa{V M  X  1))^  ^ 

CH aeA(Ra  +  ■yPa(VM  X  I))  =  T (VM  x  I)  C  VM  X  I 

The  first  equality  is  the  definition  of  TprUne.  The  second  inequality  uses  the  fact 
that  intersecting  with  a  fixed  set  can’t  make  the  result  any  bigger.  The  third  inequality 
applies  the  definition  of  T.  The  last  inequality  is  lemma  6.  □ 


Lemma  9  The  sequence  Tprune(VM  x  I)  converges  as  k  increases. 

Proof:  Lemmas  7  and  8  are  sufficient  for  convergence  of  Tprune,  since  together  they 
make  the  sequence  Tprune(Vjvf  x  I)  monotone  non-increasing,  and  a  monotone  non¬ 
increasing  sequence  which  is  bounded  below  (by  the  vector  of  empty  sets)  must  con¬ 
verge.  □ 


Definition  A  fixed  point  of  Tprune  is  any  V  s.t.  Tprune(V)  =  V.  A  maximal  fixed 
point  is  a  fixed  point  of  Tprune  that  is  not  a  strict  subset  of  any  other  fixed  point. 


Lemmas  7  and  9  imply  that  there  will  be  a  unique  maximal  fixed  point:  by  Lemma 
7  (monotonicity)  our  value  function  is  bounded  from  below  by  each  fixed  point,  and  by 
Lemma  9  (convergence)  we  eventually  converge  to  a  fixed  point,  which  must  therefore 
contain  every  other  fixed  point.  Write  Vfixed  for  this  unique  maximal  fixed  point. 

The  point  of  the  value  iteration  algorithm  is  to  find  the  maximal  fixed  point,  since 
it  tells  us  everything  we  need  to  know  about  equilibria:  we  show  below  that  this  fixed 
point  contains  all  the  value  vectors  that  are  achievable  in  equilibrium  using  the  given 
punishment  strategies.  More  specifically,  Lemma  10  guarantees  that  our  initial  V 
contains  the  maximal  fixed  point,  which  (because  of  monotonicity)  guarantees  that 
we  cannot  converge  to  a  non- maximal  fixed  point.  Lemma  11  will  tell  us  a  policy  to 
achieve,  in  expectation,  any  discounted  value  in  the  fixed  point,  and  Lemma  12  will 
use  this  policy  to  define  a  subgame-perfect  equilibrium. 


Lemma  10 

II  V  ||  co  >  Vm  =7  ||  Tprune  (^0  ||  oo  <  ||V||oo 
So,  if  ||  V|| oo  >  Vm  then  V  cannot  be  a  fixed  set. 

Proof:  Let  M  =  || V|| oo .  Then 

M  >  VM  =7  M  >  =7  (1  -  7 )M  >  Rm  (6) 

1-7 

s°,  _ 

||  Tprune  (V)  11^  <  RM  +  7  X  M  <  (1  -  j)M  +  7  X  M  =  M 

The  first  inequality  comes  from  lemma  5,  which  also  holds  for  Tprune  since  TprUne(V)  C 
T(V)  always.  The  second  inequality  holds  because  of  equation  6:  (1  —  7 )M  is  strictly 
larger  than  Rm-  □ 
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Lemma  11  Let  V(s)  be  a  fixed  point  of  Tprune.  For  any  value  vector  Ps°al  £  V(s), 
there  exists  a  joint  policy  7r(Ps°al,  s)  which  achieves  Pgoal  if  the  initial  state  is  s. 

Proof:  We  will  begin  by  defining  n (V,  s)  for  all  states  s  and  value  vectors  V  £  V(s). 
To  motivate  our  definition  we  will  pretend  that,  after  the  first  step,  we  can  achieve 
any  value  vector  V'  £  V(s')  at  any  state  s'.  We  will  then  justify  our  definition  by 
proving  that  our  defined  policy  does  in  fact  achieve  the  target  value  vector  V .  The 
proof  will  be  by  induction  on  the  number  of  time  steps  of  execution. 

To  define  tt(V,  s),  we  need  to  specify  a  distribution  over  joint  actions  to  take  from 
s,  as  well  as  which  value  vectors  we  will  try  to  achieve  if  we  wind  up  at  another  state 
s' .  Since  V  £  V(s)  and  V  is  a  fixed  point,  we  can  represent  V  as  a  convex  combination 
of  points  from  the  sets 

Qa(s)  =  Ga(s)n  (fla(S)+7^Pa,S,S'V(S'))  (7) 

s' 

for  the  joint  actions  a  £  A.  That  is,  there  exists  some  set  of  weights  wa  (with  wa  = 
1  and  (Va)  wa  >  0),  with  each  wa  corresponding  to  a  single  point  qa  in  Qa(s),  such 
that 

T,  waqa  =  V  (8) 

a 

(We  only  need  one  point  from  each  Qa(s)  since  Qa(s)  is  convex.)  Now,  we  will  choose 
a  joint  action  a  at  random  with  probabilities  wa .  By  the  definition  of  Qa(s),  we  know 
that 

da  =  Ra(s)  +  7  'y  Pa,s,s'Va,s' 

s' 

for  some  vectors  Va^s>  £  V(s').  So,  if  the  game  transitions  to  state  s',  we  will  follow 
policy  n(Vay,  s')  to  try  to  achieve  Vay. 

Now  that  we  have  defined  n(s,  V)  for  all  s  and  V  £  V(s),  we  will  prove  by  induction 
that  following  7r(s,  V)  for  k  steps  yields  an  actual  expected  discounted  value  vector 
Victual  (^  s)  which  satisfies 

||Prtual(V,s')-V1  <7  kVM 

Taking  the  limit  as  k  — >  oo  then  shows  that  tt(s,  V )  achieves  V  exactly. 

Base  Case  Following  any  policy  for  0  steps  from  any  state  s  achieves  discounted 
expected  value  y0actual(i/  s)  =  o  g[0; 

|| P0actual(^  s)  -  V\\  <  VM  =  7 °VM 

because  V  £  V(s)  and  ||V||oo  <  Vm  (Lemma  10). 

Inductive  Case  We  now  know  that  following  n(V,  s)  for  k  steps  starting  from  state 
s  yields  a  value  Vfcactual(P,  s)  which  satisfies 

||yfcactual(^s)_^ll  <7kVM  (9) 

The  expected  value  of  following  tt(V,  s)  for  k  +  1  steps  therefore  satisfies: 

||pfeafr1(y,s)-p||  = 

II  II  oo 
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V- actual  /  t  /  \  \  ^ 

fc+1  (V,  s)  -  >  waqa\\  = 

II  z J  II  OO 

a 

|  *"«  (R°  +  7  I]  Pa,.,.'  vrtual  (Va,,> ,«'))-  Wa  {R“  +  1  H  V“’°')  L  = 

a  s'  a  s' 

^llEw“Ep^'(^actual(^^^-^')||  < 

I  I  Z J  z — '  II  OO 

a  s' 

7  y~l  Wg  Pg,s,s'  ( 7fe  Vm  )  = 

a  s' 
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The  first  two  equalities  simply  plug  in  the  definitions  of  the  two  value  vectors,  where 
the  wa  are  the  weights  defined  in  equation  (8).  The  next  equality  factors  and  cancels 
common  terms  in  the  sums.  The  inequality  between  the  fourth  and  fifth  lines  holds 
because  of  the  inductive  hypothesis  stated  in  (9).  The  remaining  inequalities  rearrange 
terms  and  use  the  fact  that  each  row  of  the  transition  matrix  sums  to  1,  as  do  the 
weights  wa-  □ 

We’ve  demonstrated  how  to  construct  a  policy  that,  starting  from  some  state  s, 
comes  arbitrarily  close  to  any  given  value  vector  V  £  V(s).  Now  we  will  show  that  an 
appropriate  modification  of  this  policy  is  an  equilibrium. 

Lemma  12  Let  V  be  a  fixed  point  of  Tprune.  Any  value  vector  V  £  V(s)  is  achievable 
in  subgame-perfect  Nash  equilibrium  starting  from  state  s. 

Proof:  We  have  already  showed  (in  Lemma  11)  that  a  joint  policy  exists  to  achieve 
V  from  s.  We  can  extend  this  policy  in  a  simple  way  to  make  it  an  equilibrium: 
if  the  agents  observe  a  deviation  by  player  p,  they  will  punish  it  by  switching  to 
the  policy  7r^ev.  (For  concreteness,  if  two  or  more  agents  deviate  simultaneously,  the 
agents  will  pick  at  random  one  of  the  deviations  to  punish.)  Our  assumptions  of  public 
randomization  and  perfect  monitoring  mean  that  the  agents  always  know  what  they 
and  everyone  else  are  supposed  to  do,  and  no  agent  can  deviate  without  being  caught. 

All  that  remains  is  to  show  that,  with  the  above  threats,  no  agent  ever  wants  to 
deviate.  Since  we  have  assumed  that  n^ev  is  a  subgame-perfect  equilibrium  for  each 
p,  we  only  need  to  worry  about  the  first  deviation:  there  cannot  be  an  incentive  to 
deviate  again  in  any  subgame  in  which  some  agent  has  already  deviated  once. 

To  see  whether  an  agent  can  ever  have  an  incentive  to  deviate  first,  consider  its 
state  of  knowledge  immediately  before  acting:  it  knows  the  current  state  s  and  the 
joint  action  a  which  was  selected  by  public  randomization.  It  also  knows  a  vector 
qa  £  Qa(s)  and  vectors  Vay  £  V(s')  for  each  s';  these  vectors  satisfy 

Qa  =  Ra(s)  +  7  'y  ]  Pa,s,s’Va,s' 
s' 

If  agent  p  deviates,  it  will  receive  V't£iv(s,  a)  for  the  best  possible  deviation.  On  the  other 
hand,  if  agent  p  does  not  deviate,  it  expects  to  receive  qa'-  it  will  get  Ra  immediately 
and  VaiSi  after  one  step,  for  s'  chosen  according  to  P0iSiS'.  But  by  the  definition  of 
Qa(s)  (Eq.  7),  we  know  Qa(s)  C  Ga(s).  In  particular,  q%  >  V^^s, a),  so  agent  p  gets 
at  least  as  much  by  following  its  part  of  action  a  as  by  deviating.  □ 
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